{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n\nNodeFlow and Sampling\n=======================================\n\n**Author**: Ziyue Huang, Da Zheng, Quan Gan, Jinjing Zhou, Zheng Zhang\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "GCN\n~~~\n\nIn an $L$-layer graph convolution network (GCN), given a graph\n$G=(V, E)$, represented as an adjacency matrix $A$, with\nnode features $H^{(0)} = X \\in \\mathbb{R}^{|V| \\times d}$, the\nhidden feature of a node $v$ in $(l+1)$-th layer\n$h_v^{(l+1)}$ depends on the features of all its neighbors in the\nprevious layer $h_u^{(l)}$:\n\n\\begin{align}z_v^{(l+1)} = \\sum_{u \\in \\mathcal{N}(v)} \\tilde{A}_{uv} h_u^{(l)} \\qquad h_v^{(l+1)} = \\sigma ( z_v^{(l+1)} W^{(l)})\\end{align}\n\nwhere $\\mathcal{N}(v)$ is the neighborhood of $v$,\n$\\tilde{A}$ could be any normalized version of $A$ such as\n$D^{-1} A$ in Kipf et al., $\\sigma(\\cdot)$ is an activation\nfunction, and $W^{(l)}$ is a trainable parameter of the\n$l$-th layer.\n\nIn the node classification task we minimize the following loss:\n\n\\begin{align}\\frac{1}{\\vert \\mathcal{V}_\\mathcal{L} \\vert} \\sum_{v \\in \\mathcal{V}_\\mathcal{L}} f(y_v, z_v^{(L)})\\end{align}\n\nwhere $y_v$ is the label of $v$, and $f(\\cdot, \\cdot)$\nis a loss function, e.g., cross entropy loss.\n\nWhile training GCN on the full graph, each node aggregates the hidden\nfeatures of its neighbors to compute its hidden feature in the next\nlayer.\n\nIn this tutorial, we will run GCN on the Reddit dataset constructed by `Hamilton et\nal. <https://arxiv.org/abs/1706.02216>`__, wherein the nodes are posts\nand edges are established if two nodes are commented by a same user. The\ntask is to predict the category that a post belongs to. This graph has\n233K nodes, 114.6M edges and 41 categories. Let's first load the Reddit graph.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import numpy as np\nimport dgl\nimport dgl.function as fn\nfrom dgl import DGLGraph\nfrom dgl.data import RedditDataset\nimport mxnet as mx\nfrom mxnet import gluon\n\n# load dataset\ndata = RedditDataset(self_loop=True)\ntrain_nid = mx.nd.array(np.nonzero(data.train_mask)[0]).astype(np.int64)\nfeatures = mx.nd.array(data.features)\nin_feats = features.shape[1]\nlabels = mx.nd.array(data.labels)\nn_classes = data.num_labels\n\n# construct DGLGraph and prepare related data\ng = DGLGraph(data.graph, readonly=True)\ng.ndata['features'] = features"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Here we define the node UDF which has a fully-connected layer:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "class NodeUpdate(gluon.Block):\n    def __init__(self, in_feats, out_feats, activation=None):\n        super(NodeUpdate, self).__init__()\n        self.dense = gluon.nn.Dense(out_feats, in_units=in_feats)\n        self.activation = activation\n\n    def forward(self, node):\n        h = node.data['h']\n        h = self.dense(h)\n        if self.activation:\n            h = self.activation(h)\n        return {'activation': h}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In DGL, we implement GCN on the full graph with ``update_all`` in ``DGLGraph``.\nThe following code performs two-layer GCN on the Reddit graph.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# number of GCN layers\nL = 2\n# number of hidden units of a fully connected layer\nn_hidden = 64\n\nlayers = [NodeUpdate(g.ndata['features'].shape[1], n_hidden, mx.nd.relu),\n          NodeUpdate(n_hidden, n_hidden, mx.nd.relu)]\nfor layer in layers:\n    layer.initialize()\n\nh = g.ndata['features']\nfor i in range(L):\n    g.ndata['h'] = h\n    g.update_all(message_func=fn.copy_src(src='h', out='m'),\n                 reduce_func=fn.sum(msg='m', out='h'),\n                 apply_node_func=lambda node: {'h': layers[i](node)['activation']})\n    h = g.ndata.pop('h')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "NodeFlow\n~~~~~~~~~~~~~~~~~\n\nAs the graph scales up to billions of nodes or edges, training on the\nfull graph would no longer be efficient or even feasible.\n\nMini-batch training allows us to control the computation and memory\nusage within some budget. The training loss for each iteration is\n\n\\begin{align}\\frac{1}{\\vert \\tilde{\\mathcal{V}}_\\mathcal{L} \\vert} \\sum_{v \\in \\tilde{\\mathcal{V}}_\\mathcal{L}} f(y_v, z_v^{(L)})\\end{align}\n\nwhere $\\tilde{\\mathcal{V}}_\\mathcal{L}$ is a subset sampled from\nthe total labeled nodes $\\mathcal{V}_\\mathcal{L}$ uniformly at\nrandom.\n\nStemming from the labeled nodes $\\tilde{\\mathcal{V}}_\\mathcal{L}$\nin a mini-batch and tracing back to the input forms a computational\ndependency graph (a directed acyclic graph or DAG in short), which\ncaptures the computation flow of $Z^{(L)}$.\n\nIn the example below, a mini-batch to compute the hidden features of\nnode D in layer 2 requires hidden features of A, B, E, G in layer 1,\nwhich in turn requires hidden features of C, D, F in layer 0.\n\n|image0|\n\nFor that purpose, we define ``NodeFlow`` to represent this computation\nflow.\n\n``NodeFlow`` is a type of layered graph, where nodes are organized in\n$L + 1$ sequential *layers*, and edges only exist between adjacent\nlayers, forming *blocks*. We construct ``NodeFlow`` backwards, starting\nfrom the last layer with all the nodes whose hidden features are\nrequested. The set of nodes the next layer depends on forms the previous\nlayer. An edge connects a node in the previous layer to another in the\nnext layer iff the latter depends on the former. We repeat such process\nuntil all $L + 1$ layers are constructed. The feature of nodes in\neach layer, and that of edges in each block, are stored as separate\ntensors.\n\n.. raw:: html\n\n``NodeFlow`` provides ``block_compute`` for per-block computation, which\ntriggers computation and data propogation from the lower layer to the\nnext upper layer.\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Neighbor Sampling\n~~~~~~~~~~~~~~~~~\n\nReal-world graphs often have nodes with large degree, meaning that a\nmoderately deep (e.g.\u00a03 layers) GCN would often depend on input features\nof the entire graph, even if the computation only depends on outputs of\na few nodes, hence its cost-ineffectiveness.\n\nSampling methods mitigate this computational problem by reducing the\nreceptive field effectively. Fig-c above shows one such example.\n\nInstead of using all the $L$-hop neighbors of a node $v$,\n`Hamilton et al. <https://arxiv.org/abs/1706.02216>`__ propose *neighbor\nsampling*, which randomly samples a few neighbors\n$\\hat{\\mathcal{N}}^{(l)}(v)$ to estimate the aggregation\n$z_v^{(l+1)}$ of its total neighbors $\\mathcal{N}(v)$ in\n$l$-th GCN layer, by an unbiased estimator\n$\\hat{z}_v^{(l+1)}$\n\n\\begin{align}\\hat{z}_v^{(l+1)} = \\frac{\\vert \\mathcal{N}(v) \\vert }{\\vert \\hat{\\mathcal{N}}^{(l)}(v) \\vert} \\sum_{u \\in \\hat{\\mathcal{N}}^{(l)}(v)} \\tilde{A}_{uv} \\hat{h}_u^{(l)} \\qquad\n   \\hat{h}_v^{(l+1)} = \\sigma ( \\hat{z}_v^{(l+1)} W^{(l)} )\\end{align}\n\nLet $D^{(l)}$ be the number of neighbors to be sampled for each\nnode at the $l$-th layer, then the receptive field size of each\nnode can be controlled under $\\prod_{i=0}^{L-1} D^{(l)}$ by\n*neighbor sampling*.\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We then implement *neighbor smapling* by ``NodeFlow``:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "class GCNSampling(gluon.Block):\n    def __init__(self,\n                 in_feats,\n                 n_hidden,\n                 n_classes,\n                 n_layers,\n                 activation,\n                 dropout,\n                 **kwargs):\n        super(GCNSampling, self).__init__(**kwargs)\n        self.dropout = dropout\n        self.n_layers = n_layers\n        with self.name_scope():\n            self.layers = gluon.nn.Sequential()\n            # input layer\n            self.layers.add(NodeUpdate(in_feats, n_hidden, activation))\n            # hidden layers\n            for i in range(1, n_layers-1):\n                self.layers.add(NodeUpdate(n_hidden, n_hidden, activation))\n            # output layer\n            self.layers.add(NodeUpdate(n_hidden, n_classes))\n\n    def forward(self, nf):\n        nf.layers[0].data['activation'] = nf.layers[0].data['features']\n        for i, layer in enumerate(self.layers):\n            h = nf.layers[i].data.pop('activation')\n            if self.dropout:\n                h = mx.nd.Dropout(h, p=self.dropout)\n            nf.layers[i].data['h'] = h\n            # block_compute() computes the feature of layer i given layer\n            # i-1, with the given message, reduce, and apply functions.\n            # Here, we essentially aggregate the neighbor node features in\n            # the previous layer, and update it with the `layer` function.\n            nf.block_compute(i,\n                             fn.copy_src(src='h', out='m'),\n                             lambda node : {'h': node.mailbox['m'].mean(axis=1)},\n                             layer)\n        h = nf.layers[-1].data.pop('activation')\n        return h"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "DGL provides ``NeighborSampler`` to construct the ``NodeFlow`` for a\nmini-batch according to the computation logic of neighbor sampling.\n``NeighborSampler``\nreturns an iterator that generates a ``NodeFlow`` each time. This function\nhas many options to give users opportunities to customize the behavior\nof the neighbor sampler, including the number of neighbors to sample,\nthe number of hops to sample, etc. Please see `its API\ndocument <https://doc.dgl.ai/api/python/sampler.html>`__ for more\ndetails.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# dropout probability\ndropout = 0.2\n# batch size\nbatch_size = 1000\n# number of neighbors to sample\nnum_neighbors = 4\n# number of epochs\nnum_epochs = 1\n\n# initialize the model and cross entropy loss\nmodel = GCNSampling(in_feats, n_hidden, n_classes, L,\n                    mx.nd.relu, dropout, prefix='GCN')\nmodel.initialize()\nloss_fcn = gluon.loss.SoftmaxCELoss()\n\n# use adam optimizer\ntrainer = gluon.Trainer(model.collect_params(), 'adam',\n                        {'learning_rate': 0.03, 'wd': 0})\n\nfor epoch in range(num_epochs):\n    i = 0\n    for nf in dgl.contrib.sampling.NeighborSampler(g, batch_size,\n                                                   num_neighbors,\n                                                   neighbor_type='in',\n                                                   shuffle=True,\n                                                   num_hops=L,\n                                                   seed_nodes=train_nid):\n        # When `NodeFlow` is generated from `NeighborSampler`, it only contains\n        # the topology structure, on which there is no data attached.\n        # Users need to call `copy_from_parent` to copy specific data,\n        # such as input node features, from the original graph.\n        nf.copy_from_parent()\n        with mx.autograd.record():\n            # forward\n            pred = model(nf)\n            batch_nids = nf.layer_parent_nid(-1).astype('int64')\n            batch_labels = labels[batch_nids]\n            # cross entropy loss\n            loss = loss_fcn(pred, batch_labels)\n            loss = loss.sum() / len(batch_nids)\n        # backward\n        loss.backward()\n        # optimization\n        trainer.step(batch_size=1)\n        print(\"Epoch[{}]: loss {}\".format(epoch, loss.asscalar()))\n        i += 1\n        # We only train the model with 32 mini-batches just for demonstration.\n        if i >= 32:\n            break"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Control Variate\n~~~~~~~~~~~~~~~\n\nThe unbiased estimator $\\hat{Z}^{(\\cdot)}$ used in *neighbor\nsampling* might suffer from high variance, so it still requires a\nrelatively large number of neighbors, e.g.\u00a0\\ $D^{(0)}=25$ and\n$D^{(1)}=10$ in `Hamilton et\nal. <https://arxiv.org/abs/1706.02216>`__. With *control variate*, a\nstandard variance reduction technique widely used in Monte Carlo\nmethods, 2 neighbors for a node seems sufficient.\n\n*Control variate* method works as follows: given a random variable\n$X$ and we wish to estimate its expectation\n$\\mathbb{E} [X] = \\theta$, it finds another random variable\n$Y$ which is highly correlated with $X$ and whose\nexpectation $\\mathbb{E} [Y]$ can be easily computed. The *control\nvariate* estimator $\\tilde{X}$ is\n\n\\begin{align}\\tilde{X} = X - Y + \\mathbb{E} [Y] \\qquad \\mathbb{VAR} [\\tilde{X}] = \\mathbb{VAR} [X] + \\mathbb{VAR} [Y] - 2 \\cdot \\mathbb{COV} [X, Y]\\end{align}\n\nIf $\\mathbb{VAR} [Y] - 2\\mathbb{COV} [X, Y] < 0$, then\n$\\mathbb{VAR} [\\tilde{X}] < \\mathbb{VAR} [X]$.\n\n`Chen et al. <https://arxiv.org/abs/1710.10568>`__ proposed a *control\nvariate* based estimator used in GCN training, by using history\n$\\bar{H}^{(l)}$ of the nodes which are not sampled, the modified\nestimator $\\hat{z}_v^{(l+1)}$ is\n\n\\begin{align}\\hat{z}_v^{(l+1)} = \\frac{\\vert \\mathcal{N}(v) \\vert }{\\vert \\hat{\\mathcal{N}}^{(l)}(v) \\vert} \\sum_{u \\in \\hat{\\mathcal{N}}^{(l)}(v)} \\tilde{A}_{uv} ( \\hat{h}_u^{(l)} - \\bar{h}_u^{(l)} ) + \\sum_{u \\in \\mathcal{N}(v)} \\tilde{A}_{uv} \\bar{h}_u^{(l)} \\\\\n   \\hat{h}_v^{(l+1)} = \\sigma ( \\hat{z}_v^{(l+1)} W^{(l)} )\\end{align}\n\nThis method can also be *conceptually* implemented in DGL as shown\nbelow,\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "have_large_memory = False\n# The control-variate sampling code below needs to run on a large-memory\n# machine for the Reddit graph.\nif have_large_memory:\n    g.ndata['h_0'] = features\n    for i in range(L):\n        g.ndata['h_{}'.format(i+1)] = mx.nd.zeros((features.shape[0], n_hidden))\n    # With control-variate sampling, we only need to sample 2 neighbors to train GCN.\n    for nf in dgl.contrib.sampling.NeighborSampler(g, batch_size, expand_factor=2,\n                                                   neighbor_type='in', num_hops=L,\n                                                   seed_nodes=train_nid):\n        for i in range(nf.num_blocks):\n            # aggregate history on the original graph\n            g.pull(nf.layer_parent_nid(i+1),\n                   fn.copy_src(src='h_{}'.format(i), out='m'),\n                   lambda node: {'agg_h_{}'.format(i): node.mailbox['m'].mean(axis=1)})\n        nf.copy_from_parent()\n        h = nf.layers[0].data['features']\n        for i in range(nf.num_blocks):\n            prev_h = nf.layers[i].data['h_{}'.format(i)]\n            # compute delta_h, the difference of the current activation and the history\n            nf.layers[i].data['delta_h'] = h - prev_h\n            # refresh the old history\n            nf.layers[i].data['h_{}'.format(i)] = h.detach()\n            # aggregate the delta_h\n            nf.block_compute(i,\n                             fn.copy_src(src='delta_h', out='m'),\n                             lambda node: {'delta_h': node.data['m'].mean(axis=1)})\n            delta_h = nf.layers[i + 1].data['delta_h']\n            agg_h = nf.layers[i + 1].data['agg_h_{}'.format(i)]\n            # control variate estimator\n            nf.layers[i + 1].data['h'] = delta_h + agg_h\n            nf.apply_layer(i + 1, lambda node : {'h' : layer(node.data['h'])})\n            h = nf.layers[i + 1].data['h']\n        # update history\n        nf.copy_to_parent()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "You can see full example here, `MXNet\ncode <https://github.com/dmlc/dgl/blob/master/examples/mxnet/sampling/>`__\nand `PyTorch\ncode <https://github.com/dmlc/dgl/tree/master/examples/pytorch/sampling>`__.\n\nBelow shows the performance of graph convolution network and GraphSage\nwith neighbor sampling and control variate sampling on the Reddit\ndataset. Our GraphSage with control variate sampling, when sampling one\nneighbor, can achieve over 96% test accuracy. |image1|\n\nMore APIs\n~~~~~~~~~\n\nIn fact, ``block_compute`` is one of the APIs that comes with\n``NodeFlow``, which provides flexibility to research new ideas. The\ncomputation flow underlying a DAG can be executed in one sweep, by\ncalling ``prop_flows``.\n\n``prop_flows`` accepts a list of UDFs. The code below defines node update UDFs\nfor each layer and computes a simplified version of GCN with neighbor sampling.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "apply_node_funcs = [\n    lambda node : {'h' : layers[0](node)['activation']},\n    lambda node : {'h' : layers[1](node)['activation']},\n]\nfor nf in dgl.contrib.sampling.NeighborSampler(g, batch_size, num_neighbors,\n                                               neighbor_type='in', num_hops=L,\n                                               seed_nodes=train_nid):\n    nf.copy_from_parent()\n    nf.layers[0].data['h'] = nf.layers[0].data['features']\n    nf.prop_flow(fn.copy_src(src='h', out='m'),\n                 fn.sum(msg='m', out='h'), apply_node_funcs)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Internally, ``prop_flow`` triggers the computation by fusing together\nall the block computations, from the input to the top. The main\nadvantages of this API are 1) simplicity, 2) allowing more system-level\noptimization in the future.\n\n.. |image0| image:: https://s3.us-east-2.amazonaws.com/dgl.ai/tutorial/sampling/NodeFlow.png\n.. |image1| image:: https://s3.us-east-2.amazonaws.com/dgl.ai/tutorial/sampling/sampling_result.png\n\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.6"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}